Empirical Risk Approximation: An Induction Principle for Unsupervised Learning

نویسنده

  • Joachim M. Buhmann
چکیده

Unsupervised learning algorithms are designed to extract structure from data without reference to explicit teacher information. The quality of the learned structure is determined by a cost function which guides the learning process. This paper proposes Empirical Risk Approximation as a new induction principle for unsupervised learning. The complexity of the unsupervised learning models are automatically controlled by the two conditions for learning: (i) the empirical risk of learning should uniformly converge towards the expected risk; (ii) the hypothesis class should retain a minimal variety for consistent inference. The maximal entropy principle with deterministic annealing as an eecient search strategy arises from the Empirical Risk Approximation principle as the optimal inference strategy for large learning problems. Parameter selection of learnable data structures is demonstrated for the case of k-means clustering. 1 What is unsupervised learning? Learning algorithms are designed with the goal in mind that they should extract structure from data. Two classes of algorithms have been widely discussed in the literature { supervised and unsupervised learning. The distinction between the two classes relates to supervision or teacher information which is either available to the learning algorithm or missing in the learning process. This paper presents a theory of unsupervised learning which has been developed in analogy to the highly successful statistical learning theory of classiication and regression Vapnik, 1982, Vapnik, 1995]. In supervised learning of classiication boundaries or of regression functions the learning algorithm is provided with example points and selects the best candidate function from a set of functions, called the hypothesis class. Statistical learning theory, developed by Vapnik and Chervonenkis in a series of seminal papers (see Vapnik, 1982, Vapnik, 1995]), measures the amount of information in a data set which can be used to determine the parameters of the classiication or regression models. Computational learning theory Valiant, 1984] addresses computational problems of supervised learning in addition to the statistical constraints. 2 In this paper I propose a theoretical framework for unsupervised learning based on optimization of a quality functional for structures in data. The learning algorithm extracts an underlying structure from a sample data set under the guidance of a quality measure denoted as learning costs. The extracted structure of the data is encoded by a loss function and it is assumed to produce a learning risk below a predeened risk threshold. This induction principle is refered to as Empirical Risk Approximation (ERA) and is summarized …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning without Overfitting: Empirical Risk Approximation as an Induction Principle for Reliable Clustering

Unsupervised learning algorithms are designed to extract structure from data samples on the basis of a cost function for structures. For a reliable and robust inference process, the unsupervised learning algorithm has to guarantee that the extracted structures are typical for the data source. In particular, it has to reject all structures where the inference is dominated by the arbitrariness of...

متن کامل

Minimum Description Length Principle in Supervised Learning with Application to Lasso

The minimum description length (MDL) principle in supervised learning is studied. One of the most important theories for the MDL principle is Barron and Cover’s theory (BC theory), which gives a mathematical justification of the MDL principle. The original BC theory, however, can be applied to supervised learning only approximately and limitedly. Though Barron et al. recently succeeded in remov...

متن کامل

Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning

Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998